Wigwams Dataset Creation GUI
created by: Krzysztof Polanski
contact: k.t.polanski@warwick.ac.uk

1. THE PURPOSE OF THIS GUI

The role of the Wigwams dataset creation GUI is aiding the researcher in creating a dataset input file with suitable formatting for use in Wigwams.

To run the GUI, navigate Matlab to the dataset creation GUI folder and type Wigwams_dataset_GUI into the console.

2. ADDING A TIME COURSE

In order to add a time course data set to the structure, the following information needs to be provided:
	-	the name of the time course experiment, in "condition name"
	-	the time points covered by the experiment, in "time points" (the GUI accepts both vector notation and writing the time points out in a space-delimited manner)
	-	the unit of time the time points are measured in, in "unit"
	-	a tab-delimited expression file with the genes' expression profiles for the time course, imported via the "Import Expression File" button (the names of the genes are expected by the program in the first column of the expression file)
	-	the starting position of the expression values of the given time course in the expression profile file, in "starting position" (in order to obtain this, open the expression profile file in Excel and note the column-row combination of the first expression value, most often this will be B2)
	-	the number of replicates, in "number of replicates"
	-	the way the replicates are sorted in the expression file, via the appropriate checkbox (replicates in order, or replicates split and all the values for the particular time point in order)
	-	the manner in which to condense information from numerous replicates into one numerical value (mean, median or 2d Gaussian smoothing) in "data format"
		*	If 2d Gaussian smoothing is chosen, three extra values need to be provided - the number of adjacent time points to take into account, and the variance of the kernel for the replicate dimension and the time point dimension. The kernel treats the m replicates of an n-point time series as a m by n matrix with a distance of 1 between any adjacent fields. Prior to the smoothing, each time point has its replicate values sorted. The procedure converts the m by n matrix into a 1 by n vector, accounting for adjacent time points and extreme time point replicate values with weights dependent on the variances provided by the user.
	-	an optional list of differentially expressed genes, one gene name per line, imported via the "Import DEG List" button (if DEG lists are to be provided, they need to be provided for each time course data set - if a DEG list fails to be provided for a time course data set, this disables DEG consideration of the other time course data sets)

3. SAVING A CONDITION

Once all the information stated in 2. is provided for a given time course data set, press the "Save Condition" button. The script will import the data, format it accordingly and store it. Once the saving procedure is complete, the "number of conditions" field at the top will increment by 1 and the "Save Condition" button will turn green.

4. SAVING THE DATASET

Prior to saving the dataset, the user needs to provide information on the expression unit of the time series in the "expression unit" field. Once that is provided, the dataset can be saved via the "Save Dataset" button. The time course data sets will be trimmed down to genes that are featured in all of them. If DEG information is provided, the dataset will be trimmed down to genes that are differentially expressed in at least one time course data set.

If DEG information is provided, each time course data set will have its non differentially expressed genes' expression profiles randomly permuted, and the resulting structure will be saved as a second dataset structure with _shuffled appended to the name. It is recommended to use this structure for Wigwams.

If DEG information is provided, the list of seed genes will be comprised of genes that are differentially expressed in at least two time course data sets. Otherwise, every gene will be eligible to be a seed gene.

5. DATASET MATLAB STRUCTURE FORMAT

This produces a Matlab structure with the following fields:
	-	genes - the names of the genes featured in the dataset; the row numbers in dataset.submatrix and dataset.degs correspond to the appropriate gene
	-	conditions - the conditions corresponding to each of the time course data sets stored within, as input via "condition name"; the ordering of dataset.timepoints, dataset.timescale and the columns of dataset.degs and dataset.submatrix correspond to the appropriate condition
	-	timepoints - the time points covered by each of the experiments, as entered via "time points"; the ordering of the columns of dataset.submatrix corresponds to the appropriate time point of the appropriate conditions
	-	timescale - the unit of time the time points were measured in, as input via "unit"
	-	submatrix - the standardised (on a per-gene, per-data set) expression of the genes in dataset.genes for each of the conditions in dataset.conditions measured at time points from dataset.timepoints
	-	degs - if DEG information is provided, this matrix stores information on which of the conditions from dataset.conditions each gene from dataset.genes is differentially expressed in
	-	unit - the unit the expression is measured in, as entered via "expression unit"